SQL Server 2008 High Availability : The Fundamentals of HA

7/29/2011 5:50:54 PM

What’s New in High Availability

In general, a couple of Microsoft SQL Server 2008 configuration options offer a very strong database engine foundation that can be highly available (7 days a week, 365 days a year). Microsoft’s sights are set on being able to achieve five-nines reliability with almost everything it builds. An internal breakthrough introduced with SQL Server 2005 called “copy-on-write” technology, has enabled Microsoft to greatly enhance several of its database high availability options.

Here are a few of the most significant enhancements and new features that have direct or indirect effects on increasing high availability for a SQL Server 2008–based implementation:

Increased number of nodes in a SQL cluster— You can create a SQL cluster of up to 64 nodes on Windows Server Data Center 2008.
Enhancements to do unattended cluster setup— Instead of having to use wizards to set up SQL clustering, you can use the Unattended Cluster Setup mode. This is very useful for fast re-creation or remote creation of SQL clustering configurations.
All SQL Server 2008 services as cluster managed resources— All SQL Server 2008 services are now cluster aware.
SQL Server 2008 database mirroring— Database mirroring creates an automatic failover capability to a “hot” standby server.
SQL Server 2008 peer-to-peer replication— This option of data replication uses a publisher-to-publisher model (hence peer-to-peer).
SQL Server 2008 automatic corruption recovery from mirror— This enhancement in database mirroring recognizes and corrects corrupt pages during mirroring.
SQL Server 2008 mirroring transaction record compression— This feature allows for compression of the transaction log records used in database mirroring to increase the speed of transmission to the mirror.
SQL Server 2008 fast recovery— Administrators can reconnect to a recovering database after the transaction log has been rolled forward (and before the rollback processing has finished).
Online restore— Database administrators can perform a restore operation while the database is still online.
Online indexing— The online index option allows concurrent modifications (updates, deletes, and inserts) to the underlying table or clustered index data and any associated indexes during index creation time.
Database snapshot— SQL Server 2008 allows for the generation and use of a read-only, stable view of a database. The database snapshot is created without the overhead of creating a complete copy of the database or having completely redundant storage.
Hot additions— This feature allows for hot additions to memory and CPU.
Addition of a snapshot isolation level— A new snapshot isolation (SI) level is being provided at the database level. With SI, users can access the last committed row, using a transactionally consistent view of the database.
Dedicated administrator connection— SQL Server 2008 supports a dedicated administrator connection that administrators can use to access a running server even if the server is locked or otherwise unavailable. This capability enables administrators to troubleshoot problems on a server by executing diagnostic functions or Transact-SQL statements without having to take down the server.

At the operating system (OS) level, Virtual Server 2005 has firmly established virtualization for both development and production environments and allows entire application and database stacks to run on a completely virtual operating system footprint that will never bring down the physical server.

Note

Microsoft has announced that log shipping will be deprecated soon. Although it has been functionally replaced with database mirroring, log shipping remains available in SQL Server 2008. However, you should plan to move off log shipping as soon as you can.

Keep in mind that Microsoft already has an extensive capability in support of high availability. The new HA features add significant gains to the already feature-rich offering.

What Is High Availability?

The availability continuum depicted in Figure 1 shows a general classification of availability based on the amount of downtime an application can tolerate without impacting the business. You would write your service-level agreements (SLAs) to support and try to achieve one of these continuum categories.

Figure 1. Availability continuum.

Topping the chart is the category extreme availability, so named to indicate that this is the least tolerant category and is essentially a zero (or near zero) downtime requirement (that is, sustained 99.5% to 100% availability). The mythical five-nines falls at the high end of this category. Next is the high availability category, which has a minimal tolerance for downtime (that is, sustained 95% to 99.4% availability). Most “critical” applications would fit into this category of availability need. Then comes the standard availability category, with a more normal type of operation (that is, sustained 83% to 94% availability). The acceptable availability category is for applications that are deemed noncritical to a company’s business, such as online employee benefit package self-service applications. These applications can tolerate much lower availability ranges (sustained 70% to 82% availability) than the more critical services. Finally, the marginal availability category is for nonproduction custom applications, such as marketing mailing label applications that can tolerate significant downtime (that is, sustained 0% to 69% availability). Again, remember that availability is measured by the planned operation times of the application.

Achieving the mythical five-nines (that is, a sustained 99.999% availability) falls into the extreme availability category (which tolerates between 5.26 minutes and 1.8 days of down time per year). In general, the computer industry calls this high availability, but we push this type of near-zero downtime requirement into its own extreme category, all by itself. Most applications can only dream about this level of availability because of the costs involved, the high level of operational support required, the specialized hardware that must be in place, and many other extreme factors.

The Fundamentals of HA

Every minute of downtime you have today translates into losses that you cannot well afford. You must fully understand how the hardware and software components work together and how, if one component fails, the others will be affected. High availability of an application is a function of all the components together, not just one by itself. Therefore, the best approach for moving into supporting high availability is to work on shoring up the basic foundation components of hardware, backup/recovery, operating system upgrading, ample vendor agreements, sufficient training, extensive quality assurance/testing, rigorous standards and procedures, and some overall risk-mitigating strategies, such as spreading out critical applications over multiple servers. By addressing these first, you add a significant amount of stability and high-availability capability across your hardware/system stack. In other words, you are moving up to a necessary level before you completely jump into a particular high-availability solution. If you do nothing further from this point, you have already achieved a portion of your high-availability goals.

Hardware Factors

You need to start by addressing your basic hardware issues for high availability and fault tolerance. This includes redundant power supplies, UPS systems, redundant network connections, and ECC memory (error correcting). Also available are “hot-swappable” components, such as disks, CPUs, and memory. In addition, most servers are now using multiple CPUs, fault-tolerant disk systems such as RAID, mirrored disks, storage area networks (SANs), Network Attached Storage (NAS), redundant fans, and so on.

Cost may drive the full extent of what you choose to build out. However, you should start with the following:

Redundant power supplies (and UPSs)
Redundant fan systems
Fault-tolerant disks, such as RAID (1 through 10), preferably “hot swappable”
ECC memory
Redundant Ethernet connections

Backup Considerations

After you consider hardware, you need to look at the basic techniques and frequency of your disk backups and database backups. For many companies, the backup plan isn’t what it needs to be to guarantee recoverability and even the basic level of high availability. At many sites, database backups are not being run, are corrupted, or aren’t even considered necessary. You would be shocked by the list of Fortune 1000 companies where this occurs.

Operating System Upgrades

You need to make sure that all upgrades to your OS are applied and also that the configuration of all options is correct. This includes making sure you have antivirus software installed (if applicable), along with the appropriate firewalls for external-facing systems.

Vendor Agreements Followed

Vendor agreements come in the form of software licenses, software support agreements, hardware service agreements, and both hardware and software service-level agreements. Essentially, you are trying to make sure you can get all software upgrades and patches for your OS and for your application software at any time, as well as get software support, hardware support agreements, and both software and hardware SLAs in place to guarantee a level of service within a defined period of time.

Training Kept Up to Date

Training is multifaceted in that it can be for software developers to guarantee that the code they write is optimal, for system administrators who need to administer applications, and even for end users themselves to make sure they use the system correctly. All these types of training play into the ultimate goal of achieving high availability.

Quality Assurance Done Well

Testing as much as possible—and doing it in a very formal way—is a great way to guarantee a system’s availability. Dozens of studies over the years have clearly shown that the more thoroughly you test (and the more formal your QA procedures), the fewer software problems you will have. Many companies foolishly skimp on testing, which has a huge impact on system reliability and availability.

Standards/Procedures Followed

Standards and procedures are interlaced tightly with training and QA. Coding standards, code walkthroughs, naming standards, formal system development life cycles, protection of tables from being dropped, use of governors, and so on all contribute to more stable and potentially more highly available systems.

Server Instance Isolation

By design, you may want to isolate applications (such as SQL Server’s applications and their databases) away from each other to mitigate the risk of such an application causing another to fail.

Plain and simple, you should never put applications in each other’s way if you don’t have to. The only things that might force you to load up a single server with all your applications would be expensive licensing costs for each server’s software and perhaps hardware scarcity (strict limitations to the number of servers available for all applications). A classic example occurs when a company loads up a single SQL Server instance with between two and eight applications and their associated databases. The problem is that the applications are sharing memory, CPUs, and internal work areas, such as tempdb. Figure 2 shows an overloaded SQL Server instance that is being asked to service seven major applications (Appl 1 DB through Appl 7 DB).

Figure 2. High risk: Many applications sharing a single SQL Server 2008 instance.

The single SQL Server instance in Figure 2 is sharing memory (cache) and critical internal working areas, such as tempdb, with all seven major applications. Everything runs fine until one of these applications submits a runaway query, and all other applications being serviced by that SQL Server instance come to a grinding halt. Most of this built-in risk could be avoided by simply putting each application (or perhaps two applications) onto their own SQL Server instance, as shown in Figure 3 . This fundamental design approach greatly reduces the risk of one application affecting another.

Figure 3. Mitigated risk: Isolating critical applications away from each other.

Many companies make this fundamental error. The trouble is that they keep adding new applications to their existing server instance without a full understanding of the shared resources that underpin the environment. It is often too late when they finally realize that they are hurting themselves “by design.” You have now been given proper warning of the risks. If other factors, such as cost or hardware availability, dictate otherwise, then at least it is a calculated risk that is entered into knowingly (and is properly documented as well).